-
Notifications
You must be signed in to change notification settings - Fork 35
fix: Retry on failures writing into MongoDB #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThreaded configurable maxRetries into the DatabaseStoreConnector and initialization, added retry/drop decision based on ringbuffer.NumRequeues with rate-limited requeue via AddRateLimited, extended ringbuffer with configurable retry/backoff and requeue inspection, and added unit tests and Helm configuration for the new setting. Changes
Sequence DiagramsequenceDiagram
participant SC as Store Connector
participant MongoDB as MongoDB
participant RB as Ring Buffer
participant RL as Rate Limiter
SC->>MongoDB: InsertMany(events)
alt Insert Success
MongoDB-->>SC: Success
SC->>RB: HealthMetricEleProcessingCompleted(events)
else Insert Failure
MongoDB-->>SC: Error
SC->>RB: NumRequeues(events) -> retryCount
alt retryCount < maxRetries
SC->>RB: AddRateLimited(events)
RB->>RL: Enqueue with backoff
else retryCount >= maxRetries
SC->>RB: HealthMetricEleProcessingCompleted(events) (Forget + Done -> drop)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
platform-connectors/pkg/connectors/store/store_connector_test.go (1)
21-21: Unused sync import.The
sync.Mutexat line 318 is declared, andmu.Lock()/mu.Unlock()are called in theEventuallycallback, butcallCountis only used within that same callback and the mutex doesn't protect any shared state between the test goroutine and theFetchAndProcessHealthMetricgoroutine. The mock client'sCallsslice is accessed, which is already thread-safe via testify's internal synchronization.Consider removing the unused mutex and the
syncimport to simplify the test.♻️ Suggested simplification
import ( "context" "errors" "os" - "sync" "testing" "time"And in the test function (lines 318-319, 351-353):
- var mu sync.Mutex - callCount := 0 ... require.Eventually(t, func() bool { - mu.Lock() - defer mu.Unlock() - callCount = len(mockClient.Calls) - return callCount >= 3 + return len(mockClient.Calls) >= 3 }, 500*time.Millisecond, 10*time.Millisecond, "Should retry and eventually succeed")platform-connectors/pkg/ringbuffer/ring_buffer.go (1)
118-121: Correct workqueue retry pattern.The sequence of
AddRateLimitedfollowed byDoneis the proper pattern for workqueue retries:
AddRateLimitedschedules the item for re-processing after the backoff delayDonemarks the current processing attempt as complete, allowing the queue to process other itemsConsider adding a brief comment explaining this pattern for future maintainers.
📝 Optional: Add clarifying comment
func (rb *RingBuffer) AddRateLimited(data *protos.HealthEvents) { + // Re-queue with exponential backoff delay, then mark current attempt done rb.healthMetricQueue.AddRateLimited(data) rb.healthMetricQueue.Done(data) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
platform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.goplatform-connectors/pkg/ringbuffer/ring_buffer.go
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/connectors/store/store_connector_test.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/ringbuffer/ring_buffer.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (4)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (1)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
NewRingBuffer(55-82)WithRetryConfig(48-53)store-client/pkg/client/interfaces.go (1)
InsertManyResult(95-97)platform-connectors/pkg/connectors/store/store_connector.go (1)
DatabaseStoreConnector(42-48)
🔇 Additional comments (9)
platform-connectors/pkg/connectors/store/store_connector.go (2)
35-40: LGTM! Well-documented constant for retry cap.The constant is appropriately documented, explaining both the purpose (preventing unbounded memory growth) and the resilience benefit. The value of 3 is reasonable for transient failures.
114-136: Retry logic correctly implements bounded retries with exponential backoff.The flow is sound:
- Initial attempt fails →
NumRequeuesreturns 0 → retry- After 3 requeues,
NumRequeuesreturns 3 → dropGood inclusion of contextual information in the error log (node name, check name, event count) to aid debugging dropped events.
platform-connectors/pkg/connectors/store/store_connector_test.go (2)
307-362: Good test coverage for retry success scenario.The test correctly:
- Configures short delays for fast test execution
- Sets up mock to fail twice then succeed
- Verifies 3 total
InsertManycalls- Confirms queue is emptied after success
364-414: Good test coverage for max retry drop scenario.The test correctly verifies that:
- After 4 attempts (initial + 3 retries), the event is dropped
- The queue is emptied to prevent unbounded memory growth
- The expected number of
InsertManycalls matchesMaxRetries + 1One minor observation: consider adding an assertion or comment to make the relationship between
MaxRetries(3) and expected calls (4) more explicit for maintainability.platform-connectors/pkg/ringbuffer/ring_buffer.go (5)
28-33: Good default retry configuration with clear documentation.The timing values are well-chosen:
- 500ms base delay provides quick recovery for brief transient failures
- 3s max delay caps the backoff to avoid excessive waits
- The comment explaining the progression (500ms → 1.5s → 3s) aids understanding
41-53: Clean functional options pattern implementation.This enables configuring shorter delays in tests while maintaining production defaults, aligning well with the PR objective of making retry logic "configurable to avoid slowing unit tests."
55-82: LGTM! Constructor properly integrates configurable retry settings.The implementation correctly:
- Initializes config with production defaults
- Applies any provided options to override
- Uses the configured delays for the exponential backoff rate limiter
108-111: Correct:Forgetclears rate limiter state before marking done.Calling
ForgetbeforeDoneensures the rate limiter's failure count for this item is cleared, preventing stale retry metadata from accumulating.
123-125: LGTM! Correctly exposes requeue count for retry logic.
|
/ok to test ab004a6 |
I believe for the k8s one we might already be using retry one error -- https://github.com/NVIDIA/NVSentinel/blob/main/platform-connectors/pkg/connectors/kubernetes/process_node_events.go#L88 which I believe should handle the retries for us.
Yeah, it is a misnomer when we initially started off, we said we'd do a ring buffer, but we never ended up doing so since we didn't want to drop or overwrite events. If we do make a ring buffer we will likely also have to maintain a WAL like you mentioned so that we can keep the memory growth bounded and flush to disk at some cadence to not loose events. |
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
36a2b21 to
582424c
Compare
|
/ok to test 582424c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @distros/kubernetes/nvsentinel/values.yaml:
- Line 115: Add an inline comment for the maxRetries field explaining what it
controls, its default value, acceptable range/units, and its behavior (e.g.,
whether it applies per request or overall retry attempts and how backoff is
handled); update the maxRetries line (the maxRetries key) to include a brief
one-line comment above or to the right that documents purpose, default (3), and
any limits or interaction with backoff/retry logic.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
distros/kubernetes/nvsentinel/templates/configmap.yamldistros/kubernetes/nvsentinel/values.yamlplatform-connectors/main.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.go
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/main.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.go
**/values.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/values.yaml
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (1)
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (2)
platform-connectors/main.go (2)
platform-connectors/pkg/connectors/store/store_connector.go (2)
DatabaseStoreConnector(35-42)InitializeDatabaseStoreConnector(58-80)platform-connectors/pkg/server/platform_connector_server.go (1)
InitializeAndAttachRingBufferForConnectors(70-72)
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
platform-connectors/pkg/connectors/store/store_connector.go (2)
InitializeDatabaseStoreConnector(58-80)DatabaseStoreConnector(35-42)platform-connectors/pkg/ringbuffer/ring_buffer.go (2)
NewRingBuffer(55-82)WithRetryConfig(48-53)store-client/pkg/client/interfaces.go (1)
InsertManyResult(95-97)
🔇 Additional comments (9)
distros/kubernetes/nvsentinel/templates/configmap.yaml (1)
28-28: LGTM!The config entry is correctly added to the Helm template and properly references the values.yaml field.
platform-connectors/main.go (1)
136-159: LGTM!The config-based initialization properly reads and converts the maxRetries value, following the established patterns in this file for other configuration parameters.
platform-connectors/pkg/connectors/store/store_connector_test.go (3)
298-298: LGTM!The test correctly updated to include the new maxRetries parameter.
305-357: Excellent test coverage for retry-on-failure scenario.The test correctly validates that:
- Failed writes are retried with exponential backoff
- Queue eventually drains after a successful retry
- InsertMany is called the expected number of times
The test setup with mock failures followed by success effectively exercises the retry logic.
359-411: Excellent test coverage for drop-after-max-retries scenario.The test correctly validates the critical behavior that prevents unbounded memory growth:
- Events are dropped after exceeding maxRetries
- Queue drains even when all attempts fail
- InsertMany is called exactly 4 times (1 initial + 3 retries) as expected with maxRetries=3
This test ensures the system won't accumulate events indefinitely during prolonged outages.
platform-connectors/pkg/connectors/store/store_connector.go (4)
41-41: LGTM!The maxRetries field is appropriately added to the connector struct to support configurable retry behavior.
44-56: LGTM!The constructor properly accepts and stores the maxRetries parameter.
58-80: LGTM!The initialization function correctly accepts maxRetries, passes it to the constructor, and includes it in the success log for observability.
111-129: Excellent retry logic implementation with proper safeguards.The implementation correctly:
- Retries failed writes with exponential backoff via rate-limited requeue
- Drops events after exceeding maxRetries to prevent unbounded memory growth
- Provides rich structured logging for both retry and drop scenarios
- Includes relevant event details (node name, check name) in drop logs for troubleshooting
The condition
retryCount < maxRetriesmeans with maxRetries=3, events are attempted up to 4 times total (1 initial + 3 retries), which aligns with the test expectations and is a reasonable interpretation.
|
/ok to test 582424c |
Signed-off-by: Suket Sharma <[email protected]>
Signed-off-by: Suket Sharma <[email protected]>
Signed-off-by: Suket Sharma <[email protected]>
582424c to
f75251e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
distros/kubernetes/nvsentinel/values.yaml (1)
117-117: Add inline comment documenting themaxRetriesconfiguration.Per coding guidelines, all values in Helm chart
values.yamlshould be documented with inline comments. Consider adding a description explaining the purpose, valid range, and behavior (e.g., what happens when set to 0).📝 Suggested documentation
mongodbStore: enabled: false clientCertMountPath: "/etc/ssl/mongo-client" - maxRetries: 3 + # Maximum number of retry attempts for MongoDB write operations on transient failures. + # Set to 0 to disable retries. Total attempts = 1 (initial) + maxRetries. + maxRetries: 3Based on learnings and coding guidelines for
**/values.yaml.platform-connectors/main.go (1)
144-151: Consider validatingmaxRetriesis non-negative.The code correctly reads and converts
StoreConnectorMaxRetriesfrom the config. However, there's no validation that the value is non-negative before passing it downstream. A negative value from misconfiguration could cause unexpected behavior in the retry logic.🛡️ Optional validation
maxRetriesFloat, ok := config["StoreConnectorMaxRetries"].(float64) if !ok { return nil, fmt.Errorf("failed to convert StoreConnectorMaxRetries to float: %v", config["StoreConnectorMaxRetries"]) } maxRetries := int(maxRetriesFloat) + if maxRetries < 0 { + return nil, fmt.Errorf("StoreConnectorMaxRetries must be non-negative, got: %d", maxRetries) + } storeConnector, err := store.InitializeDatabaseStoreConnector(ctx, ringBuffer, databaseClientCertMountPath, maxRetries)platform-connectors/pkg/ringbuffer/ring_buffer.go (1)
113-116: Consider consolidating withHealthMetricEleProcessingCompleted.
HealthMetricEleProcessingFailednow has identical implementation toHealthMetricEleProcessingCompleted. If both are semantically equivalent (item done, no retry), consider consolidating or documenting when to use each.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
distros/kubernetes/nvsentinel/templates/configmap.yamldistros/kubernetes/nvsentinel/values.yamlplatform-connectors/main.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.goplatform-connectors/pkg/ringbuffer/ring_buffer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- distros/kubernetes/nvsentinel/templates/configmap.yaml
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/main.goplatform-connectors/pkg/ringbuffer/ring_buffer.goplatform-connectors/pkg/connectors/store/store_connector.goplatform-connectors/pkg/connectors/store/store_connector_test.go
**/values.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/values.yaml
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧠 Learnings (2)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/values.yaml
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/store/store_connector_test.go
🧬 Code graph analysis (2)
platform-connectors/main.go (1)
platform-connectors/pkg/connectors/store/store_connector.go (2)
DatabaseStoreConnector(35-42)InitializeDatabaseStoreConnector(58-80)
platform-connectors/pkg/ringbuffer/ring_buffer.go (1)
node-drainer/pkg/config/config.go (1)
Duration(37-39)
🔇 Additional comments (13)
platform-connectors/main.go (1)
268-268: LGTM!The call site is correctly updated to pass the config map, consistent with the pattern used for
initializeK8sConnector.platform-connectors/pkg/connectors/store/store_connector_test.go (3)
298-303: LGTM!The test correctly passes the new
maxRetriesparameter toInitializeDatabaseStoreConnector.
306-357: Well-structured retry test.The test correctly verifies retry behavior with exponential backoff. The
WithRetryConfigoption with short delays ensures fast test execution. The mock expectations (2 failures, then success) and assertion of 3 total calls correctly validate the retry mechanism.Minor note:
cancel()at line 356 is redundant withdefer cancel()at line 310, but harmless.
359-411: Correct max-retries test coverage.The test properly validates that events are dropped after exceeding
maxRetries. The assertion of 4InsertManycalls (1 initial + 3 retries) correctly reflects the expected behavior whenmaxRetries=3.platform-connectors/pkg/connectors/store/store_connector.go (4)
35-42: LGTM!The
maxRetriesfield addition toDatabaseStoreConnectoris clean and well-placed.
44-56: LGTM!The constructor correctly accepts and assigns the new
maxRetriesparameter.
58-80: LGTM!The initialization function correctly threads
maxRetriesthrough and logs it for operational visibility.
110-132: Retry logic is well-implemented.The implementation correctly:
- Queries retry count before deciding to requeue or drop
- Uses rate-limited requeue for exponential backoff
- Logs sufficient context for debugging (retry count, max retries, event count, first event details on drop)
- Properly marks items completed when dropping to prevent unbounded memory growth
Note: When
maxRetries=0, the conditionretryCount < r.maxRetries(i.e.,0 < 0) is false, so retries are effectively disabled. If this is intentional, consider documenting it in the values.yaml comment.platform-connectors/pkg/ringbuffer/ring_buffer.go (5)
28-33: Good default configuration.The default delays (500ms base, 3s max) provide reasonable production defaults. The comment documenting the retry timing progression is helpful.
41-53: LGTM!Clean functional options pattern with good encapsulation (unexported
configstruct, exportedOptiontype andWithRetryConfigfunction).
55-82: LGTM!The constructor correctly applies the functional options pattern with sensible defaults. The exponential failure rate limiter is appropriate for transient failure handling.
108-111: Correct lifecycle management.Adding
Forget()beforeDone()ensures the rate limiter's failure count is cleared when processing completes successfully. This prevents stale retry counts from affecting future operations.
118-125: LGTM!The
AddRateLimitedmethod correctly sequences operations:
AddRateLimitedre-queues the item with exponential backoff delayDonereleases the processing lock from the earlierGet()
NumRequeuesproperly exposes the workqueue's failure count for retry decisions.
Signed-off-by: Suket Sharma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
platform-connectors/main.go (1)
144-151: Consider validatingmaxRetriesis positive.The config value is read and converted but not validated. A
maxRetriesof 0 or negative would be accepted, which could cause events to be dropped immediately without retry attempts if misconfigured.🔧 Suggested validation
maxRetries := int(maxRetriesInt64) + if maxRetries < 0 { + return nil, fmt.Errorf("StoreConnectorMaxRetries must be non-negative, got: %d", maxRetries) + } storeConnector, err := store.InitializeDatabaseStoreConnector(ctx, ringBuffer, databaseClientCertMountPath, maxRetries)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
platform-connectors/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/main.go
🧬 Code graph analysis (1)
platform-connectors/main.go (1)
platform-connectors/pkg/connectors/store/store_connector.go (2)
DatabaseStoreConnector(35-42)InitializeDatabaseStoreConnector(58-80)
🔇 Additional comments (1)
platform-connectors/main.go (1)
267-272: LGTM!The call site correctly passes the
configmap to propagate theStoreConnectorMaxRetriessetting. This follows the same pattern used forinitializeK8sConnector.
Summary
Going through the platform connectors, I noticed that we drop health events if there's a failure writing into MongoDB. I think there's value in retrying across a few seconds when writing to MongoDB before we bail. This ensures that if there's a primary failover happening in the underlying data layer or a brief networking interruption, we're resilient to it. Fwiw, I don't think
ring_buffer.gois an actual implementation of a ring buffer - it seems to be a thin wrapper around workqueues and effectively unbounded so you could still run into memory issues, so that's worth addressing too at a later point.I think there's value in also making the same changes for the K8s connector since that's got the same issue IMO but I'll wait for feedback in this PR before doing that too.
In the long term, I think we should do the following -
I've added in some unit tests, but wasn't able to test this against an actual cluster yet. I'm doing that concurrently but figured I could get feedback in the interim.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.